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We propose a novel, operational framework to formally describe the semantics of concurrent pro- 
grams running within the context of a relaxed memory model. Our framework features a "temporary 
store" where the memory operations issued by the threads are recorded, in program order. A memory 
model then specifies the conditions under which a pending operation from this sequence is allowed 
to be globally performed, possibly out of order. The memory model also involves a "write grain," 
accounting for architectures where a thread may read a write that is not yet globally visible. Our 
formal model is supported by a software simulator, allowing us to run litmus tests in our semantics. 

1 Introduction 

The hardware evolution towards multicore architectures means that the most significant future perfor- 
mance gains will rely on using concurrent programming techniques at the application level. This is 
currently supported by some general purpose programming languages, such as JAVA or C/C++. The 
semantics that is assumed by the application programmer using such a concurrent language is the stan- 
dard interleaving semantics, also known as sequential consistency (SC, [11]). This is also the semantics 
assumed by most verification methods. However, it is well-known (2) that this semantics is not the one 
we observe when running concurrent programs in optimizing execution environments, i.e. compilers and 
hardware architectures, which are designed to run sequential programs as fast as possible. For instance, 
let us consider the program 

p:=tt; || q:=tt; 
r :=lq " ri:=!p 

where we use ML's notation Ip for dereferencing the pointer - or reference, in ML's jargon - p. If the 
initial state is such that the values of p and q are both ff , we cannot get, by the standard interleaving 
semantics, a final state where the value of both vq and r\ is ff. Still, running this program may, on most 
multiprocessor architectures, produce this outcome. This is the case for instance on a TSO machine Q 
where the writes p := tt and q := tt are put in (distinct) buffers attached with the processors, and thus 
delayed with respect to the reads ! q and ! p respectively, which get their value from the (not yet updated) 
main memory. In effect, the reads are reordered with respect to the writes. Other reordering optimiza- 
tions, which may also be introduced by compilers, yield similar failures of sequential consistency (see the 
survey j2l), yet sequential consistency is generally considered as a suitable abstraction at the application 
programming level. 

Then a question is: how to ensure that concurrent programs running in a given optimized execution 
environment appear, from the programmer's point of view, to be sequentially consistent, behaving as in 
the interleaving semantics? A classical answer is: the program should not give rise to data races in its 
sequentially consistent behavior, keeping apart some specific synchronization variables, like locks. This 
is known as the "DRF (Data Race Free) guarantee," that was first stated in SOU, and has been widely 
advocated since then (see H] [111). An attractive feature of the DRF guarantee is that it allows the 
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programmer to reason in terms of the standard interleaving semantics alone. However, there are still 
some issues with this property. First, one would sometimes like to know what racy programs do, for 
safety reasons as in JAVA for instance, or for debugging purposes, or else for the purpose of establishing 
the validity of program transformations in a relaxed memory model. Second, the DRF guarantee is more 
an axiom, or a contract, than a guarantee: once stated that racy programs have undefined semantics, how 
do we indeed guarantee that a particular implementation provides sequentially consistent semantics for 
race free programs? 

Clearly, to address such a question, there is a preliminary problem to solve, namely: how do we 
describe the actual behavior of concurrent programs running in a relaxed execution environment? This 
is known to be a difficult problem. For instance, to the best of our knowledge, the JAVA Memory Model 
(JMM) 11121 is still not sound. Moreover, its current formal description is fairly complex. To our view, 
this is true also regarding the formalization of the C++ primitives for concurrent programming |H|6], or 
the formalization of the PowerPC memory model lfl4l . Our intention here is not to describe a specific 
memory model, be it a hardware, low-level one, or the memory model for a high-level concurrent pro- 
gramming language, like JAVA or C++. Our aim is rather to design a semantical framework that would be 

• flexible enough to allow for the description of a wide range of memory models; 

• simple enough to support the intuition of the programmer and the implementer; 

• precise enough to support formal analysis of programs. 

(Since we are talking about programs, there will be a programming language, but the particular choice 
we make is not essential to our work.) 

To address the problem stated above, we adopt the operational style advocated in [15J, which, 
besides being "widely accessible to working programmers" lfT5ll . allows us to use standard techniques 
to analyse and verify programs, proving properties such as the DRF guarantee [7 ] for instance. In Q 
[131 . write buffers are explicitly introduced in the semantic framework, and their behavior accounts for 
some of the reorderings mentioned above. The model we propose goes beyond the simple operational 
model for write buffering, by introducing into the semantic framework a different intermediate structure, 
between the shared memory and the threads. The idea is to record in this structure the memory operations 
- reads and write, or loads and stores, in low level terminology - that are issued by the threads, in 
program order. We call the sequence of pending operations issued by the threads a temporary store. 
Then these operations may be delayed, and finally performed, with regard to the global shared memory, 
out of order. To be globally performed, an operation from the temporary store must be allowed to 
overtake the operations that were previously issued, that is, the operations that precede it in the temporary 
store. Then a key ingredient in our model is the commutability predicate, that characterizes, for a given 
memory model, the conditions under which an operation from the temporary store may be performed 
early. This accounts in particular for the usual relaxations of the program order, and also for the semantics 
of synchronization constructs, like barriers. 

In some relaxed memory models, some fairly complex behaviors arise that cannot be fully explained 
by relaxations of the program order. These behaviors are caused by the failure of write atomicity 
To deal with this feature, we introduce another key ingredient to characterize a memory model. In our 
framework, with each pending write is associated a visibility, that is the set of threads that can see it, 
and can therefore read the written value. Depending on the memory model, and more specifically on 
the (abstract) communicating network topology between threads (or processors), not any set of threads 
is allowed to be a legitimate visibility. For instance, the Sequential Consistency model ifTTI only allows 
the empty set, and the singletons to be visibility sets, meaning that only the thread issuing a write can 
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see it before it is globally performed. Then the definition of a memory model involves, besides the 
commutability predicate, a "write grain," which specifies which visibility a write is allowed to acquire. 
This accounts for the fact that some threads can read others' writes early 0. Our model then easily ex- 
plains, in operational terms, the behavior of a series of "litmus tests," such as IRIW, WRC, RWC and CC 
discussed in (6l for instance, and the tests from lPT4ll . designed to investigate the PowerPC architecture. 
Regarding this particular memory model, we found only three cases where our formalization of the main 
PowerPC barriers is more strict than the one of Ifl4l . However, these are cases where the behavior that 
our model forbids was never observed during the extensive experiments on real machines done by Sarkar 
& al. (and reported in files available on the web as a supplement to their paper). On the other hand, for 
all the litmus tests that can be expressed in our language, the behaviors that are observed in Sarkar's 
experiments on real machines are accounted for in our model, which therefore is not invalidated by these 
experimental results. Needless to say, the experimental test suite provided by Sarkar & al. was invaluable 
for us to see which behaviors the model should explain. These litmus tests were, among others, run in a 
software simulator that we have built to experiment with our semantics. 

Compared to other formalizations of relaxed semantics, our model is truly operational. By this we 
mean that it consists in a set of rules that specify what can be the next step to perform, to go from one 
configuration to another. This contrasts with [ 8 ] for instance, where a whole sequence of steps is only 
deemed a valid behavior if it can be shown equivalent to a computation in normal order. We notice 
that, again constrasting JH, our model preserves a notion of causality: a read can only return a value 
that is present in the shared memory, or that is previously written by some thread. Our notion of a 
temporary store is quite similar to the "reorder box" of 1131 . but formulated in the standard framework of 
programming language semantics. In some approaches, including [4[ and J3, the various relaxations of 
the order of memory operations are described by means of rewrite rules on traces of memory operations 
(which again are similar to our temporary store). Notice that permuting operations in a trace is, in 
general, a cyclic process. Regarding the relaxation of write atomicity, and more specifically the read- 
others '-write-early capability (as illustrated by the IRIW litmus test in subsection 14.21 below, where no 
relaxation of the program order is involved), the only work we know that proposes a formal operational 
formulation of this capability is |[T4l which, to our view, provides a quite complicated semantics of this 
feature. We think that our formalization, by means of write visibility, is much simpler than the one of 
lfl4l . Moreover, by relying on a concrete notion of state, our model should be more amenable to standard 
programming languages proof techniques, like for establishing that programs only exhibit sequentially 
consistent behavior Q, or more generally to achieve mathematical analysis and verification of programs. 

Note. The web page http : / / www- sop . inria . f r/indes/MemoryModels/| contains a full 
version of the paper. The additional contents are explained in the text. 



2 The Core Language 

Our language is a higher-order, imperative and concurrent language a la ML, that is a call-by-value A- 
calculus extended with constructs to deal with a mutable store. (This choice of a functional core language 
is largely a matter of taste.) In order to simplify some technical developments, the syntax is given in 
administrative normal form. In this way, only one construction, namely the application of a function to 
an argument, is responsible for introducing an evaluation order (the program order). Assuming given a 
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set Var of variables, ranged over by x, y, z 



., the syntax is as follows: 



v 



x I Axe 



tt | ff | 



vaiues 



6 G Sar 



barriers 



e G £ 



w | (t>e) | (if w then eo else ei) 
(ref u) I (!-u) | (vq := ui) | 6 



expressions 



As usual, the variable x is bound in an expression Axe, and we consider expressions up to a-conversion, 
that is up to the renaming of bound variables. The capture-avoiding substitution of a value v for 
the free occurrences of x in e is denoted {x^v}e. We shall use some standard abbreviations like 
(let x = eo in e\) for (Axeieo), which is also denoted eo ; c\ whenever x does not occur free in e±. 
We shall sometimes (in the examples) write expressions in standard syntax, which is easily converted 
to administrative form, like for instance converting (eoei) into (let / = eo in (/ei)), or (v := e) into 
(let x = e in (v := x)). 

The barrier constructs are "no-ops" in the abstract (interleaving) semantics of the language. Such 
synchronization constructs are often considered low-level. However, we believe they can also be useful 
in a high-level concurrent programming language, for "relaxed memory aware" programming (see ll6l). 
We do not focus on a particular set Bar here, so the language should actually be LiBar), but in the 
following we shall give some examples of useful barriers, and see how to formalize their semantics. 
In the full version of this paper we also consider constructs for spawning and joining threads, and for 
locking references. 

As usual, to formalize the operational semantics of the language, we have to extend it, introducing 
some run-time values. Namely, we assume given a set IZef of references, ranged over by p, q . . .. These 
are the values returned by reference creation. In the examples we shall examine, the names suggest 
that such a reference should actually be regarded as a register, which is not shared with other threads. We 
still use e to range not only over expressions of the source language C, but also over expressions built 
with run-time values, that is, possibly involving references. 

A step in the semantics consists in evaluating a redex inside an evaluation context. The syntax of the 
latter is as follows: 



As usual, we denote by E[e] the run-time expression obtained by filling the hole in E by e. The semantics 
is specified as small step transitions C — > C between configurations C, C of the form (S,T) where S 
and T are respectively the store and the thread system. To define the latter, we assume given a set Tid 
of thread indentifiers, ranged over by t. The store S, also called here the memory, is a mapping from a 
finite set dom(S') of references to values. The thread system T is a mapping from a finite set dom(T) 
of thread identifiers, subset of Tid, to run-time expressions. If dom(T) = {ti, . . . ,t n } and T{t{) = e« we 
also write T as 



The reference operational semantics, that is the standard interleaving semantics, is given in Figure 1 . 



E 



I («E) 



evaluation contexts 



(ii,e x ) || ••• ||(t„,e„) 
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(S, (t,B[(Xxev)]) 


II r) - 


-> (S,(t,E[{xi->v}e])||r) 


(S, (i,E[(if ttthen e else ei)l) 


II r) - 


■>• (S,(t,E[eo])||T) 

V ' \ ' L « J / II / 


(S,(tE[(if ff then e n else ei)l) 


II T) - 
ii -*■ / 


->• f5,ftE[eil)||r) 


(S,(t,E[(ref V )]) 


II T) - 


+ (SU{p^v},(t,E\p])\\T) 


(S,(*,E[(!p)]) 


II r) 


> (S,(t,EH)||T) 


(5, (t,E[(p :=v)\) 


II T) - 


» (%:= V ],(t,E[0])||T) 


(5,(t,E[6]) 


II T) - 


+ (5,(t,E[0])||T) 



ifp^dom(5) 
if S(p) = v 



Figure 1 : Reference Operational Semantics 

3 Relaxed Computations 
3.1 Preliminary Definitions 

The relaxed operational semantics is formalized by means of small steps transitions 

RC — > RC' 
M 

between relaxed configurations RC and RC' . The M parameter is the memory model. Let us first 
describe the relaxed configurations. For this we need to introduce some technical ingredients. In the 
relaxed semantics a read can be issued by a thread, evaluating a subexpression (\p), while not imme- 
diately returning a value. In this way the read can be overtaken by a subsequent operation. To model 
this, we shall dynamically assign to each read operation a unique identifier, returned as the value read. 
That is, we extend the language with names, or identifiers, to point to future values. The set Xdent of 
identifiers is assumed to be disjoint from VarUlZef, and is ranged over by t. We shall use g to range over 
TZefLlXdent. The identifiers i £ Xdent are values in the extended language, still denoted by v, but notice 
that Vol denotes the set of (not relaxed) values, that do not contain any identifier i. We shall require that 
only true values, not relaxed ones, can be stored. It should be clear that substituting a relaxed value v for 
an identifier i in an expression e results in a valid expression, denoted {it->v}e. 

Our next technical ingredient is the set Mop(C) of memory operations in the language C. These 
represent the instructions that are issued by the threads, but are not necessarily immediately performed. 
The set Mop(C) of memory operations comprises the barriers b € Bar and the read and write operations, 
respectively denoted rd g)t and wr^ where g G TZef U Xdent, i G Xdent, W C Tid is a set of thread 
names, and / C Xdent is a set of identifiers. We call the set W in wr^ the visibility of the write (we 
comment on this, and on the / component, below). Finally, we introduce operations of the form rd L that 
we call a read mark, meaning that a read has occurred, where i serves as identifying the corresponding 
write. That is, the syntax of memory operations is as follows: 



ZeMop(C) ::= rd 



Q,l 



rd, 



wr 



-W,I 

Q,V 



We can now define a relaxed configuration RC as a triple RC = (S,a,T) where S and T are as above, 
and a is a sequence of pairs where t € Tid is a thread name and £ € Mop(C) a memory operation. 
The meaning of (t,£) in a sequence a is that £ is a memory operation issued by thread t. The sequence 
a then records the pending memory operations issued by the threads, which will not necessarily be 
performed (on the shared memory) in the order in which they appear in a. We shall call such a a a 
temporary store. We denote by T>c the set Tid x Mop(C), so that the set of temporary stores is E£, 
the set of finite sequences over £/;. We denote by e the empty sequence, and we write a ■ a' for the 
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7^ c_ 

/ 


A f<? /T 

-> V»5,(7, 




{ o 

{S,cr 


[t, rj[{\\ tt then eo else 


1 r7~>\ 

J j c - 


(6,(7, 


(t,E[e J)||T) 


(5,(7, 


(t,E[(if ff then e else ei)j) 


T) <- 


■> (5, (7, 


/ i r\ r 1 \ 1 1 m\ 

(t,E[ei])||T) 




(5,(7,(t,E[(ref«)]) 


T) - 


■> (S,(7 


•(i,wr;$),(i,Eb])||T) p fresh 




(5 5 (7,(t,E[(!^)]) 


T) - 


■> (S,(T 


•(t,rd e)t ),(t,E[i])||r) fc fresh 




(5,(7, (t,E[(e--=v)]) 


T) - 


■> (S,(7 


•(t,wrW),(t,E[0])||r) 




(5,(7,(t,E[6])| 


T) - 




.(t,6),(t,E[0])||T) 



Figure 2: A^-Relaxed Operational Semantics (Threads) 

concatenation of the two sequences a and a'. We say that a relaxed configuration (S,a,T) is normal 
whenever cr = e, and no expression occurring in the configuration (that is, in the store 5 or the thread 
pool T) contains an identifier. 

3.2 The Relaxed Semantics 

We present the relaxed semantics in two parts: the first one describes the evaluation of the threads, that 
is, the contribution of the T component in the semantics, and the second one explains how the memory 
operations from the temporary store a are performed. One could say that the instructions executed by the 
threads are "locally performed," while the operations executed from the temporary store will be "globally 
performed," as their effect is made visible to the other threads. The particular memory model M. is 
irrelevant to the local evaluation of threads, and therefore in Figure 2, which presents this evaluation, we 

simplify — > into ^-h In the rules for reducing (ref v) and (! q), "p fresh" and "t fresh" mean that p and 

M 

l do not occur in the configuration. 

The relaxed semantics differs from the reference semantics in several ways. The main difference is 
that the effect on the memory - if any - of evaluating the code is delayed. Namely, instead of updating the 
memory, the effect of evaluating (p := v), or more generally (q := v) where the exact reference to update 
may still be undetermined, consists in recording the write operation, with a default empty visibility, at 
the end of the sequence of pending memory operations. Creating a reference, reducing (ref v), has the 
same effect, once a new reference name is obtained. Reducing a dereferencing operation (! q) does not 
immediately return a proper value, but creates and returns a fresh identifier i G Zdent, to be later bound 
to a definite value, while appending a corresponding read operation to the temporary store. A barrier just 
appends itself at the end of the temporary store. Notice that the rules of Figure 2 are not concerned with 
the store 5. As an example, considering the thread system T of Example (Q} given in the Introduction, 
assigning the thread names to to the thread on the left and t\ to the one on the right, assuming lq and i\ 
to be fresh identifiers, and executing to followed by t\ we can reach the following temporal store: 

(7 = (i , wrJ;J) • (t , rd, )t0 ) • (h , wr $ q f tt ) ■ (t x , rd p>tl ) 

A relaxed configuration (5, a, T) can also perform actions that originate from the temporary store a. 
These steps are performed independently from the evaluation of threads, in an asynchronous way. To 
define these transitions, we need to say a bit more about the memory model M.. We shall not focus 
here on a particular memory model, since our purpose is to design a general framework for describing 
the semantics of concurrent programs in a relaxed setting. However, we shall make some minimal 
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hypotheses about the M. parameter. But let us first say what M. consists of. We assume that this is a 
pair M. = W) made of a commutability predicate *1 and a "write grain" W. These two components 
provide a formalization of the approach of Adve and Gharachroloo in [2], who distinguish these two key 
features as the basis for categorizing memory models. 

The commutability predicate delineates the relaxations of the program order that are allowed in 
the weak semantics under consideration, and in particular it provides semantics for barriers. This first 
component 1 of a memory model is a subset of x E^, that is a binary predicate relating temporary 
stores a £ with issued operations G This predicate is expressing which operations issued 
by some thread are allowed to be performed early, that is, out of order in the relaxed semantics. Indeed, 
if the temporary store is a • (t, £) • a' with a *l (t, £), then the operation £ from thread t may, in general, be 
globally performed, as if it were the first one, and removed from the temporary store. We read a *l (t, £) 
as: may overtake a, or: a allows to be performed. We assume, as an axiom satisfied by any 
memory model, that the first operation in the temporary store is always allowed to execute, that is, for 
any £ and t: 

e1(t,0 (E) 

The W component of a memory model is a set of subsets of Tid, comprising the set of the allowed write 
visibilities. In the relaxed semantics, with each write operation wr^/ is associated a visibility W, which 
is a (possibly empty) set of thread identifiers. (We delay the discussion of the set / to the subsection l3.3l ) 
The default visibility of a write when it is issued, as prescribed in Figure 2, is 0, so we assume that for 
any memory model this is an allowed visibility, that is G W. The visibility of a write may dynamically 
evolve (within W), but we shall assume that it can only grow. The threads in W see the write, while in 
the temporary store, and these threads can therefore read the corresponding value, possibly before it is 
globally visible (in that case the / component of the write is extended). The W component allows us to 
deal with write atomicity, or, more generally, with the extent to which the threads are allowed to read 
each others writes. In a hardware architecture, this is determined by a particular topology and behavior 
of the interconnection network. Thus, for example, assuming three different threads t, t' and t" , a write 
wrj^f in the temporal store can be prematurely read by thread t and t' but not from thread t". 

We can now formulate the rules for the — > transitions as regards the memory. These are given in 

M 

Figure 3, with = M.. In the rule R2 we use a restricted commutability predicate a *] Bar 

ignoring the operations from a that are not synchronization operations, that is: 

a^ Bar (t,0 ^ def a\Bar*\(t t £) 

where a \ Bar is the restriction of the sequence a to the set Bar, that is the subsequence of a containing 
only the issued barriers. 

We now comment the rules. In all cases but the early ones (R2 and R5), performing an operation from 
the temporary store a consists in checking that the operation can be moved, up to % at the head of a, and 
then in removing the operation from a while possibly performing some effect. Namely, such an effect is 
produced when the performed operation is a read or a write. The reference that is concerned by the effect 
must be known in these cases. A read may also return a value if it can be moved to a corresponding, 
visible write (R2). In this case, the read operation should not be blocked by barriers previously issued 
but not yet globally performed. This is expressed as o"o *l (i,rd Pj J. The read operation does not 
completely vanish, but is transformed in a read mark rd t , where i identifies the matching write. When 
read is resolved using rule R2, the identifier of the read is added to the / set of the write used to serve the 
read. The purpose of this set is to maintain the ordering of some memory operations, as explained below 
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(S,a,T) — >• (S,{i^v}(a -a 1 ,T)) Rl (read) 

if a = a -(t, rd P)t ) ■ o\ k, a *\ (t, rd Pi J kS(p)=v 

(S,a,T) ► (S,{i^v}(ao-(t',\N^ IU{L} )-a 1 -(t,rdA-a 2 ,T)) R2 (read early) 

*1,W 

if a = a ■ (f , wrjji 7 ) • a x ■ (t, rd p>t ) ■a 2 &i£f& 

ff 1 < l(t,rd p , t )&o-o*l aw ' (t,rd Pl4 ) 
(S,a,T) - ^> (S 1 , do • <Ti,T) R3 (read) 

if (j = (To • (i, rdj • (Ti & (To *1 (t, rdj or 

r /'j./ Tid,/U{t}\ r n c- *, / ./ Tid,IU{t.}\ 
O"0 = t>0 ' (* > wr P." ) -Oi & 1 (t ,wr p> „' lJ ) 

(5,<7,T) — >• (S , [p:=w],o- -o-i,r) R4 (write) 

if cr = cr • 0, wrJJi 7 ) • o"! & cr 1 (i, wr^/) & i> G VaZ 



(5, a, T) — >• (5, fJ • (i, wr^'' 7 ) • <Ti , T) i?5 (wnYe earfy ) 

if a = a -(t,\N^ I )-a 1 &it£W' kW CW' £W 
(S,cr,T) > (S,ao-a\,T) RQ (barrier) 

if cr = (To • (t, b) ■ (Ti & (To *1 (t, 6) 
Figure 3: X-Relaxed Operational Semantics (Memory) 



in the subsection 13.31 As we shall see in Section [4731 a read mark is only useful in relation with barriers 
and can be eliminated from the temporary store as specified by R3. Notice that when we say that the 
read (t, rd P)t ) can be "moved," this is only an image: there is no transformation of the temporary store, 
but only a condition on it, namely, in Rl, ao 1 (t,rd p , L ). In the rules Rl and R2 for read operations, 
there is a global replacement of the identifier i associated with the read by the actual value v that is read: 
in these rules {ii— >-u}(<r,T) stands for such a replacement, which does not affect the / component in the 
writes. (Recall that we required that an identifier such as i cannot appear in the store.) Similarly, a write 
operation wr^ from the temporary store do • (i,wr^, ) • oi may update (rule RA) the memory when 
q is a reference p, v is in Val and the write is allowed to commute with the preceding operations, that 
is (To *l (t : Wp V ). An early write action in R5 has only the effect of modifying the temporary store, by 
extending the visibility of the write to more threads. 



An obvious remark about the relaxed semantics is that it contains in a sense the interleaving seman- 
tics, with temporary stores containing at most one operation: one can mimick a transition of the latter 
either by one local step, or by a local step immediately followed by a global action. One can also im- 
mediately see that if W = {0}, then the rule R5 cannot be used, and consequently no early read can 
take place. If, in addition to 0, W only contains the singletons {t} for t G Tid, the read early rule R2 is 
restricted to the "read-own- write-early" capability [2]. In the write early rule i?5, the requirement t € W 
means that we do not consider memory models where the "read-others' -write-early" capability would be 
enabled, but not the "read-own-write-early" one (again, see Q). 



In the full version of the paper we also provide a formalization of speculative computations in our 
framework. 
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3.3 Memory Models: Requirements 

In the next section we briefly illustrate the expressive power of our framework for relaxed computations, 
by showing some programs exhibiting behaviors that are not allowed by the reference semantics. (Many 
more examples are given in the full version of this paper.) Most of these examples are standard "litmus 
tests" found in the literature about memory models, that reveal in particular the consequences of relaxing 
in various ways the normal order of evaluation. In most cases, the relaxations of program order can be 
specified by a binary relation on Y,£. It is actually more convenient to use the converse relation, which 
can usually be more concisely described. We call this a precedence relation. Given such a binary relation 
V on pairs (t,£) G £c, the commutability relation is supposed to satisfy 

(u,,OP(a/,0 => Vay.-(a>,£)Vl(u/,0) 

That is, an operation in a temporary store is prevented from being globally performed by another, previ- 
ously issued one, that has precedence over it. A more positive formulation of this property is: 

<t>,£)V1(u/,0 => -.((u;,0?V,O) (Ap) 

Before examining various relaxations of the program order, by way of examples, we discuss some prece- 
dence pairs that are most often assumed in memory models. For instance, if we do not assume any 
constraint as regards the commutability of writes, from the program 

(p:=tt);(p:=ff) 

we could get as a possible outcome a state where the value of p in the memory is tt, by commuting 
the second write before the first. This is clearly unacceptable, because this violates the semantics of 
sequential programs. Then we should assume that two writes on the same reference issued by the same 
thread cannot be permuted. Similarly, a write should not be overtaken by a read on the same reference 
issued by the same thread, and conversely, otherwise the semantics of the sequential programs 

(p:=tt);(r:=lp) 
(r := lp);(p := tt) 

would be violated. We shall then require that any memory model satisfies axiom (A^) where < is the 
minimal precedence relation enjoying the following properties, where the free symbols are implicitly 
universally quantified: 

ge{g'}Uldentk ) j (t, wr^, J ) * (t 1 , rd eV ) & 

t'G^U{t}or//0//' J ^ \ (t,w^)M(t',w^'/) 

Q€{ef}Uldent => (t,ri ff)t H(*,wrJJ) 

These properties ensure in particular that the precedence relations discussed above are enforced: among 
the operations of a given thread, one cannot commute for instance a read and a write on the same refer- 
ence. Notice however that it is not required that the program order is maintained as regards two reads on 
the same reference. Therefore, from the program 
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if initially S(p) = ff, we could end up in a state where the value of t\ is ff, while the one for ro is tt. If 
one wishes to preclude such a behavior, one can simply add 

q G {p}Uldent =>- (t,rd e ,i,)V(t,rd Pil ,r) 

to the precedence relation. 

There are three cases where the < precedence relates two distinct threads. The first one, that is 
(t, wr^, 7 ) A (t', rd P)t ) where t 7 € W, means that a thread if "sees" the writes, previously issued by other 

threads, that include t' in their scope - the same holds with (i,wr^ 7 ) A (t',vjr^'f) where t' G W. The 

precedence (t, wr^/) 4 (t',vjr^'J') where J ^ ^ J' means that the order of writes on a given reference 
must be respected if these writes have been read by some threads (this is similar to the "coherence 
order" of [14]). Finally, (t,wr^) A (t',rd t ) where i G I means that an early read cannot vanish from 
the temporary store before the corresponding write. These two properties explains the role of the / 
component in our model. One should notice that no specific precedence assumption is made at this point 
regarding the barriers. Then our definition of the notion of a memory model is as follows: 

Definition (Memory Models) 3.1. A memory model M for Cis a pair (%W) where G W, and 
the commutability predicate *1 C T>* c x ££ satisfies the axioms (E) and (A A ). 

As an example memory model, one can define SC, for Sequential Consistency, as 

SC = ({e} x £c, {0} li{{t}\tE Tid}) 

which obviously satisfies Definition 3.1 (the axiom (A^) is vacuously true). All the examples discussed 
in the following section hold in the minimal, or most relaxed, memory model M A {C) = (*] + ,2 Ttd ), 
where % is the largest commutability predicate satisfying (A^), 2 1 ~ ld is the set of all subsets of Tid. 

In our work we mainly use commutability properties that are generated by precedence relations, in 
the sense of axiom (A-p). Then one could think of defining a memory model as a pair {V, W), instead of 
(% W). However, we shall see in Section l4~3l a case where this is not general enough. More precisely, we 
shall see a case where we have to say that -i(<r *1 not on the basis that a contains an operation that 

has precedence over but because there is a subsequence of a which, as a whole, has precedence 
over it. 

4 Examples 

Now we examine a few examples of programs exhibiting relaxed behaviors that are not allowed by the 
reference semantics. (As mentioned above, in the full version of this paper we examine many more 
examples.) In all the examples we assume that the initial values of the references are ff. We shall omit 
the superscript W in (t,wr^, J ) whenever W = 0, and similarly for /. 

4.1 Simple Relaxations 

Let us start with the most common relaxation, the one of the W— >K order [2], supported by simple write 
buffering as in TSO machines. That is, we are assuming that a write (t, wr?; ; ) does not have precedence 
over a read (t, rdg jt ) if p ^ q. The litmus test here is the thread system T of the example (Q} given in the 
Introduction (with an obvious thread names assignment). If we let 



a = (t ,wr Pitt ) ■ (t ,rd q>L0 ) ■ (ti,wr 9iii ) ■ (ti,rd Pitl ) 
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we have 

(S,e,T) -^-> (5,(7, (t ,r := l ) \\(h,ri : = ti)) 
It is then easy to see that, given that the order W— s>R may be relaxed, we have 

(S,e,T) ^ (S,a',(t ,r := ff) \\{t u r x := ff)) 

where a' = (io> wr p,tt) ' (*ii wr g,tt)- These write operations can now be executed, and we reach a final 
state (5',e,r') where S'(p) = tt = S'{q) and 5'(r ) = ff = 5'(n). 

To restore 5C behavior in a relaxed memory model, the language must offer appropriate synchroniza- 
tion means. Most often these are barriers, that disallow some relaxations, when inserted between mem- 
ory operations. For instance, to forbid the W— >R relaxation, a natural barrier to use is (wr) (write/read), 
which cannot overtake a write, and cannot be overtaken by a read from the same thread. In our frame- 
work, the semantics of barriers are specified by the commutability predicate: they have no other effect 
than preventing some reorderings. In the case of (wr), we require that the commutability predicate satis- 
fies (Ap <wr) ) for a precedence relation V^r) such that 

(*> wr ^)^(wr) (M wr »V) (*> rc W) 

(We do not have to specify that (wr) has precedence over rd t , because, due to the conditions in R2, a read 
mark is never preceded by a read barrier in the temporary store.) This is a local barrier since it blocks 
only operations from the thread that issued it. Then for restoring an SC behavior to the example we are 
discussing, it is enough to insert this barrier in both threads: 

p := tt; q := tt; 
(wr); (wr); 
ro := ! q r\ := \p 

The threads will issue (wr) before the reads rd gjt0 and rd P)tl . Given the precedence relations we just 
assumed as a semantics for (wr), these reads cannot proceed until the barrier has disappeared from 
the temporary store. The rule R8 requires, for a barrier to vanish, that it may be commuted with the 
previously issued operations. Then in the example above, this can only happen for (wr) once the writes 
wr P;£i and wr 9i # have been globally performed. 

We can deal in a similar way with the relaxation of the order W— >W, which when added to the 
previous relaxation characterizes the PSO memory model. And similarly with R— »R and R— >W which 
are sufficient to characterize the RMO model as described in [2]. In each case a corresponding local 
barrier, (ww), (rr) or (rw) can be used to restore sequential consistency. 

4.2 Early Reads and Writes 

In this subsection, and the following one, we are concerned with architectures relaxing the atomicity 
of writes. There are several examples to illustrate the write early rule i?5, in combination with R2, to 
show the ability for a thread to "read-own-write-early" or "read-others '-write-early", according to the 
terminology of [2], that is the ability for a thread to read a write that has been previously issued, either 
by the thread itself or by a foreign thread, before the write updates the shared memory. An example of 
the first, which holds in TSO models, is as follows: 

p := tt; q := tt; 

r :=lp;{tt) r 2 :=lq;(tt) 
ri:=lq(ff) r 3 :=lp(ff) 
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where the unexpected outcome is indicated by the annotations (tt) and (ff) associated with the assign- 
ments. Let us assume that the write grain W contains two sets Wq and W\ such that to € Wq and 
t\ € W\. Then it is easy to see that from this thread system we can, using the write early rule RA, reach 
a configuration where the temporary store is do • o\ where 

0o = (t ,wr^) • (t ,rd PM ) • (t ,wr r0jt0 ) • (t ,rd q>L1 ) 

°1 = (*l.wStt) ' (*l' rd 9.t2) • (*l> wr r 2 , t2 ) ' (*l. r d Plt3 ) 

Then by i?2 both to and 12 can take the value tt, whereas, given that the order W— >K is relaxed (and that 
a read mark does not have precedence over a read), both i\ and 13 take the value ff from the shared store, 
before it is updated by performing the writes wr $ and wrj^ ( . 

As regards the read-others-write-early ability, the best known litmus test is IRIW (Independent 
Reads of Independent Writes): 

— h\\ r o-=-P;(tt) II r 2 :=\q; (tt) 
P-tt\\q.-tt\\ ri . = lq{ff) II r 3 :=lp(ff) 

In our framework, this example is accounted for in the following way. Assume that W contains two sets 
Wq and W\ such that {to,t 2 } Q Wo and {ti,ts} C W 1( with i 3 ^ Wo an d h ^W\. Then we have, using 
R5 twice: 

(5,e,T) ^> (S,(t o ,wr^ ) • (t 2 ,rd p , t0 ) ■ (ti,wr^ ) ■ (t 3 ,rd q ,, 2 ),T') 

Now since the write of p is made visible to thread t 2 , the identifier lq can take the value tt, and similarly 
t2 takes the value tt, by the rule R2. Since the writes from to an d t\ are not visible from £3 and £2 
respectively, these threads may read the value ff from the shared memory for both q and p. One finally 
reaches a state where S'Itq) = tt = S'(r2) whereas S'(r±) = ff = S'(rs). Notice that in this computation 
we never have to "commute" operations (the precedence relation could be anything here), that is, this 
computation proceeds in program order, and therefore inserting local barriers in t 2 and t% would not 
influence it. Similar examples that are discussed in [6j[14l, such as WRC, RWC and CC, can be explained 
in the same way. This is the case for instance of WRC (Write-to-Read Causality) - without fence since, 
as with IRIW, we follow the program order here: 

r :=lp; (tt) r x :=!?; («) 
V ' 11 q-=tt 11 r 2 :=\p(ff) 

Here the write (p := tt) is issued, and, with some appropriate assumption about the write grain, made 
visible to the second thread (but not to the third), which will then assign the value tt to vq. Then the 
write (q := tt) is globally performed, and, before the operation wr^j reaches the store, the third thread 
is executed, reading the values tt for q in (r% := ! q) and ff for p in (r 2 := \p). That is, the outcome 
S'(r ) = tt = S'(n) and S'(r 2 ) = ff is allowed. 



4.3 Global Barriers 

In a model that enables the read-others '-write-early capability, one needs in the language some barrier 
having a global effect on writes, that is, a barrier that is prevented from vanishing by writes from foreign 
threads. We shall use here the case of PowerPC, as described by lfl4ll . to exemplify the framework. 
Indeed, the PowerPC architecture offers such a strong sync barrier, which imposes the program order to 
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be preserved between any pair of (local) reads and writes. This means that it enjoys the same precedence 
relations as (wr), (ww), (rr) and (rw). The global effect of sync is the one suggested above: sync 
maintains the order between two writes, the first one being a visible write from a foreign thread, and the 
second being a local write. Then to specify the semantics of this barrier we just have to add the following: 

t'e W => fowrJ^PsyncC^sync) 

The PowerPC architecture also provides an Iwsync barrier, which is weaker than sync. First, this is a 
(ww), (rw) and (rr) barrier, but it does not order the pairs of writes and reads, to preserve some TSO 
optimizations. Therefore, we cannot define the semantics of Iwsync by means of a binary precedence 
relation, as we did up to now. Nevertheless, the following precedences are part of the semantics of 
Iwsync in our framework: 

(t,rd L )V\ w (t, Iwsync) & (t,rd s>l )V\ w (t, Iwsync) 7>iw(*,wr]^) 
t = t' or t'eW => (*,wrj r /)^iw(< / , Iwsync) 

Next, we have to say that Iwsync is a (rr) barrier, even though it does not have precedence over reads. 
Then we assume that the commutability predicate satisfies the following: 



a = o"o • (t, Iwsync) • a\ & 

0o = <5o • (t, rd e>v ) ■ Si or a = S Q - (t, rdj • S x 



M w (t,rd p y)) 



This completes the definition of the semantics of Iwsync. Let us see two examples. If we insert global 
barriers into the IRIW configuration, as follows: 

r :=lp;(tt) r 2 :=lq;(tt) 
p := tt || q := tt || Iwsync; || sync; 

ri:=lq(ff) r 3 :=lp(ff) 

then the unexpected outcome is still not prevented to occur. This is obtained as follows: the operation of 
the second thread (ti) is issued, and then the ones of £3, to an d t 2 , in that order. Then the visibility of 
(to,wr Pjtt ) is made global, and therefore t 2 can read the value tt for p. Since the write from to is allowed 
to be performed immediately, the read mark left when performing rd p t0 may disappear. The Iwsync from 
t 2 is still prevented to vanish by the write from to, but it no longer blocks the second read of t 2 . 

In the case of the WRC litmus test [6], inserting Iwsync barriers prevents the unexpected outcome 
showed in 

r :=lp; (tt) n := ! q ; (tt) 
p := tt Iwsync; Iwsync; 

q ~tt r 2 :=lp(ff) 

to occur. Similarly, inserting sync barriers in the third and fourth threads of the IRIW example restores 
an SC behavior. To see this, we have to explore all the possible behaviors, and this is where our software 
tool is useful. 



5 The Simulator 



The set of configurations that may be reached by running a program in the relaxed semantics can be 
fairly large, and it is sometimes difficult, and error prone, to find a path to some (un)expected final 
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state, or to convince oneself that such an outcome is actually forbidden, that is, unreachable. Then, to 
experiment with our framework, we found it useful to design and implement a simulator that allows us to 
exhaustively explore all the possible relaxed behaviors of (simple) programs. As usual, we have to face 
a state explosion problem, which is much worse than with the standard interleaving semantics. 

Our simulator is written in JAVA. Its main function step computes all the configurations reachable 
in one step from a given configuration. A brute force simulator would then recursively use the step 
function, in a depth first manner, in order to compute reachable configurations that have an empty tem- 
porary store and a terminated thread pool, where all the thread expressions are values. This methodology 
does not consume much memory space, being basically proportional to the log of the number of reach- 
able states or, similarly, to the depth of the tree induced by the step function. However, the number of 
configurations in this tree grows very fast with the size of the expression to analyse. For instance, with 
the example CD given in the Introduction, this brute force strategy has been aborted after generating more 
than 20 x 10 10 configurations and after half a day of computing, even if it is obvious that only four differ- 
ent final configurations may be reached. Therefore, a first improvement is to transform the tree traversal 
by a dag construction merging all the same configurations. Less configurations will be constructed and 
analyzed (only 60588 for the example), but all these configurations must be simultaneously in memory. 

Several other optimizations have been used. In order to reduce the search space, in the simulator we 
use a refined rule R5 where the visibility set W 1 is supposed to be either Tid or a subset of live(T) U 
rdt(oi) where the sets live(T) and rdt(cr) of thread identifiers are defined as follows: 

live(0) = rdt(e) = 

live((i,e) \\T) = live(T) U {t | e Val} rdt((t,£) • a) = rdt(cr) U {t \ 3q, l. £ = rd QjL } 

We have not presented this formulation in Figure 3 only because it is conceptually a bit more obscure. 
With this optimization, in our example, the number of configurations falls down from 60588 to 51068. 
A more dramatic optimization is obtained by introducing a distinction between "registers," that are local 
to some thread, and shared references. As suggested above, the registers are denoted rj in the examples. 
Indeed, these registers are not concerned by early reads from foreign threads, and therefore applications 
of the rule R5 to them may be drastically restricted. In this way, the number of generated configurations 
in the case of example (fl]) decreases from 51068 to 13356 for instance. Furthermore, one may observe 
that, since removing an operation from a temporary store a never depends on what follows this operation 
in a, the strategy that consists in applying first the rules of Figure 2 for evaluating the threads before 
attempting anything else (that is, applying a rule from Figure 3) will never miss any final configuration. 
This allows us to generate only 2814 configurations in the case of example dU for instance. 

However, the optimized search strategy outlined above still fails in exploring exhaustively some com- 
plex litmus tests. In such cases, we make a tradeoff between time and space: for each temporary store 
that can be reached by applying the rules of Figure 2 as far as possible, we generate the reachable final 
configurations, but we do not share this state space among the various possible temporary stores. For 
instance, still regarding the example (H}, there are 20 possible "maximal" temporary stores, and running 
independently the simulator in each case generates an average number of 500 configurations, so that the 
total of number of generated configurations following this simulation method raises up to 10280. Never- 
theless this allowed us to successfully explore a large number of litmus tests, and in particular all the ones 
presented by Sarkar & al. lTT4l in their web files. We report upon this in the full version of the paper. Our 
simulator is available on the web page http : //www-sop . inria . f r/indes/MemoryModels/. 
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6 Conclusion 

We have introduced a new, operational way to formalize the relaxed semantics of concurrent programs. 
Our model is flexible enough to account for a wide variety of weak behaviors, and in particular the 
odd ones occurring in a memory model that does not preserve the atomicity of writes. To our view, our 
model is also simple enough to be easily understood by the implementer and the programmer, and precise 
enough to be used in the formal analysis of programs. 
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